Julia is an open source programming language created at MIT in 2012. Julia combines the many advantages of the data science stack into one programming language:
-Julia has the statistical capabilities like R.
-Julia is as easy to pick up as a general programming language like Python.
-Julia can call 33 (and counting!) data visualization packages in Python, R, and Mathplot.
-Julia can handle scientific computing like Mathematica.
-Julia can handle parallel computing, and, thus, is faster than Python.
-Julia can call Python, C, and Fortran libraries.
-Julia is a programming language that compiles at run time, which means it is faster than Python.
All of these advantages mean Julia has the potential to serve as a "one stop shop" for data scientists looking to test machine learning models and then deploy them into production.
Do I think Julia is ready for the "big time" yet? No, not quite. It is still quite hard to use Julia to do the three steps of testing machine learning models: 1) build; 2) train; and 3) validate. Julia for machine learning is not fully developed; however, the potential is there. Do remember that Julia is only its official 1.0+ release so it has only stopped being in the "beta" stage in 2018.
Julia is a language to mark on your "to watch" list since it can become the tool of choice in the near future.
Here are a few resources to learn more about the advantages of Julia:
Bezanson, J., Karpinski, S., Shah, V., and Edelman, A. (2012). Why we created julia. Julia. Retrieved from https://julialang.org/blog/2012/02/why-we-created-julia/.
Yegulalp, S. (2020). Julia v. python: which is best for data science? InfoWorld. Retrieved from https://www.infoworld.com/article/3241107/julia-vs-python-which-is-best-for-data-science.html.
Julia (2020). Julia. Retrieved from https://julialang.org/.
You will need to download the current stable release of Julia from here: https://julialang.org/downloads/. I am running the 64-bit Windows version.
I am using Jupyter notebook via Anaconda.
From the Julia Command Prompt, type in the following:
using Pkg
Pkg.add("IJulia")
You will need to install the following packages in Julia Command Prompt:
Pkg.add("DataFrames") #this is similar to pandas in Python.
Pkg.add("CSV") #this is to read in CSV files.
Pkg.add("StatsBase") #this is to do basic statistical analysis.
Pkg.add("Plots") #this is an interface to other data visualization packages like Plotly, Gadfly, PyPlot, etc. Julia does not have a native data visualization package.
Pkg.add("StatPlots") #this is another interface to data visualization packages that allow you to work with data frames.
Pkg.add("MLDataUtils") #this package enables data preprocessing tasks for machine learning.
Pkg.add("ScikitLearn") #this is the ever popular machine learning package from Python.
println("Hello World!")
Hello World!
pwd()
"C:\\Users\\micha\\OneDrive\\Desktop\\Rockhurst University\\Classes\\BIA 6303 - Predictive Models\\Module8\\code"
cd("C:\\Users\\micha\\OneDrive\\Desktop\\Rockhurst University\\Classes\\BIA 6303 - Predictive Models\\Module8\\data")
pwd()
"C:\\Users\\micha\\OneDrive\\Desktop\\Rockhurst University\\Classes\\BIA 6303 - Predictive Models\\Module8\\data"
using DataFrames
using CSV
#ENV["COLUMNS"] = 400
df = DataFrame(CSV.File("Churn_Calls.csv"))
5,000 rows × 20 columns (omitted printing of 14 columns)
| state | account_length | area_code | international_plan | voice_mail_plan | number_vmail_messages | |
|---|---|---|---|---|---|---|
| String3 | Int64 | String15 | String3 | String3 | Int64 | |
| 1 | AK | 1 | area_code_408 | no | no | 0 |
| 2 | AK | 36 | area_code_408 | no | yes | 30 |
| 3 | AK | 36 | area_code_415 | yes | yes | 19 |
| 4 | AK | 41 | area_code_415 | no | no | 0 |
| 5 | AK | 42 | area_code_415 | no | no | 0 |
| 6 | AK | 48 | area_code_415 | no | yes | 37 |
| 7 | AK | 50 | area_code_408 | no | no | 0 |
| 8 | AK | 51 | area_code_510 | yes | yes | 12 |
| 9 | AK | 52 | area_code_408 | no | no | 0 |
| 10 | AK | 52 | area_code_415 | no | yes | 24 |
| 11 | AK | 52 | area_code_510 | no | no | 0 |
| 12 | AK | 55 | area_code_408 | no | yes | 39 |
| 13 | AK | 58 | area_code_510 | no | no | 0 |
| 14 | AK | 59 | area_code_408 | no | no | 0 |
| 15 | AK | 59 | area_code_510 | no | no | 0 |
| 16 | AK | 61 | area_code_415 | no | yes | 15 |
| 17 | AK | 68 | area_code_415 | no | no | 0 |
| 18 | AK | 71 | area_code_510 | no | no | 0 |
| 19 | AK | 74 | area_code_415 | no | no | 0 |
| 20 | AK | 76 | area_code_415 | no | no | 0 |
| 21 | AK | 76 | area_code_415 | no | yes | 22 |
| 22 | AK | 78 | area_code_408 | no | no | 0 |
| 23 | AK | 78 | area_code_510 | no | no | 0 |
| 24 | AK | 83 | area_code_415 | no | no | 0 |
| 25 | AK | 86 | area_code_408 | no | no | 0 |
| 26 | AK | 88 | area_code_415 | no | yes | 37 |
| 27 | AK | 91 | area_code_510 | no | no | 0 |
| 28 | AK | 94 | area_code_415 | no | no | 0 |
| 29 | AK | 96 | area_code_408 | no | yes | 29 |
| 30 | AK | 97 | area_code_415 | no | yes | 24 |
| ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
Now let's check some basic attributes.
first(df,10)
10 rows × 20 columns (omitted printing of 14 columns)
| state | account_length | area_code | international_plan | voice_mail_plan | number_vmail_messages | |
|---|---|---|---|---|---|---|
| String3 | Int64 | String15 | String3 | String3 | Int64 | |
| 1 | AK | 1 | area_code_408 | no | no | 0 |
| 2 | AK | 36 | area_code_408 | no | yes | 30 |
| 3 | AK | 36 | area_code_415 | yes | yes | 19 |
| 4 | AK | 41 | area_code_415 | no | no | 0 |
| 5 | AK | 42 | area_code_415 | no | no | 0 |
| 6 | AK | 48 | area_code_415 | no | yes | 37 |
| 7 | AK | 50 | area_code_408 | no | no | 0 |
| 8 | AK | 51 | area_code_510 | yes | yes | 12 |
| 9 | AK | 52 | area_code_408 | no | no | 0 |
| 10 | AK | 52 | area_code_415 | no | yes | 24 |
last(df,10)
10 rows × 20 columns (omitted printing of 14 columns)
| state | account_length | area_code | international_plan | voice_mail_plan | number_vmail_messages | |
|---|---|---|---|---|---|---|
| String3 | Int64 | String15 | String3 | String3 | Int64 | |
| 1 | WY | 157 | area_code_415 | yes | no | 0 |
| 2 | WY | 159 | area_code_415 | no | no | 0 |
| 3 | WY | 160 | area_code_408 | no | no | 0 |
| 4 | WY | 161 | area_code_415 | yes | no | 0 |
| 5 | WY | 164 | area_code_510 | no | no | 0 |
| 6 | WY | 171 | area_code_415 | no | no | 0 |
| 7 | WY | 177 | area_code_415 | no | no | 0 |
| 8 | WY | 185 | area_code_415 | yes | yes | 30 |
| 9 | WY | 215 | area_code_510 | no | no | 0 |
| 10 | WY | 225 | area_code_415 | no | no | 0 |
size(df)
(5000, 20)
summary(df)
"5000×20 DataFrame"
names(df)
20-element Vector{String}:
"state"
"account_length"
"area_code"
"international_plan"
"voice_mail_plan"
"number_vmail_messages"
"total_day_minutes"
"total_day_calls"
"total_day_charge"
"total_eve_minutes"
"total_eve_calls"
"total_eve_charge"
"total_night_minutes"
"total_night_calls"
"total_night_charge"
"total_intl_minutes"
"total_intl_calls"
"total_intl_charge"
"number_customer_service_calls"
"churn"
Julia uses 1-index (instead of Python's 0-index). Here we print row numbers 1 through 5.
df[1:5,:]
5 rows × 20 columns (omitted printing of 14 columns)
| state | account_length | area_code | international_plan | voice_mail_plan | number_vmail_messages | |
|---|---|---|---|---|---|---|
| String3 | Int64 | String15 | String3 | String3 | Int64 | |
| 1 | AK | 1 | area_code_408 | no | no | 0 |
| 2 | AK | 36 | area_code_408 | no | yes | 30 |
| 3 | AK | 36 | area_code_415 | yes | yes | 19 |
| 4 | AK | 41 | area_code_415 | no | no | 0 |
| 5 | AK | 42 | area_code_415 | no | no | 0 |
And here we are printing out the first two columns of rows 1 through 5.
df[1:5, 1:2]
5 rows × 2 columns
| state | account_length | |
|---|---|---|
| String3 | Int64 | |
| 1 | AK | 1 |
| 2 | AK | 36 |
| 3 | AK | 36 |
| 4 | AK | 41 |
| 5 | AK | 42 |
This is another way to reference columns.
df.total_day_minutes
5000-element Vector{Float64}:
175.2
146.3
171.9
159.3
171.0
211.7
183.6
135.8
217.0
170.9
148.3
139.3
131.9
⋮
170.2
235.9
180.4
167.4
82.7
189.6
160.6
231.2
175.7
154.1
83.6
182.7
We can use the same subsetting approach from R to print a subset.
df[df.total_day_minutes .> 300,:]
66 rows × 20 columns (omitted printing of 14 columns)
| state | account_length | area_code | international_plan | voice_mail_plan | number_vmail_messages | |
|---|---|---|---|---|---|---|
| String3 | Int64 | String15 | String3 | String3 | Int64 | |
| 1 | AK | 152 | area_code_510 | no | no | 0 |
| 2 | AL | 141 | area_code_510 | no | yes | 28 |
| 3 | AZ | 114 | area_code_510 | no | no | 0 |
| 4 | CO | 154 | area_code_415 | no | no | 0 |
| 5 | CT | 23 | area_code_510 | no | no | 0 |
| 6 | CT | 50 | area_code_408 | no | no | 0 |
| 7 | DC | 82 | area_code_415 | no | no | 0 |
| 8 | DC | 93 | area_code_408 | no | yes | 22 |
| 9 | DE | 129 | area_code_510 | no | no | 0 |
| 10 | FL | 100 | area_code_415 | no | no | 0 |
| 11 | FL | 113 | area_code_415 | no | no | 0 |
| 12 | IA | 44 | area_code_415 | no | no | 0 |
| 13 | IN | 48 | area_code_510 | no | no | 0 |
| 14 | KS | 70 | area_code_415 | no | no | 0 |
| 15 | KS | 92 | area_code_510 | no | no | 0 |
| 16 | KS | 126 | area_code_408 | no | no | 0 |
| 17 | KY | 75 | area_code_415 | no | no | 0 |
| 18 | LA | 67 | area_code_510 | no | no | 0 |
| 19 | MA | 117 | area_code_408 | no | no | 0 |
| 20 | MD | 62 | area_code_408 | no | no | 0 |
| 21 | MD | 93 | area_code_408 | yes | no | 0 |
| 22 | ME | 79 | area_code_510 | yes | no | 0 |
| 23 | ME | 80 | area_code_408 | no | no | 0 |
| 24 | MI | 74 | area_code_415 | no | no | 0 |
| 25 | MN | 13 | area_code_510 | no | yes | 21 |
| 26 | MN | 152 | area_code_415 | no | no | 0 |
| 27 | MO | 112 | area_code_415 | no | no | 0 |
| 28 | MS | 13 | area_code_415 | no | no | 0 |
| 29 | MS | 29 | area_code_510 | no | no | 0 |
| 30 | NC | 161 | area_code_415 | no | no | 0 |
| ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
Or we can use the filter function
filter(row -> row.total_day_minutes > 300, df)
66 rows × 20 columns (omitted printing of 14 columns)
| state | account_length | area_code | international_plan | voice_mail_plan | number_vmail_messages | |
|---|---|---|---|---|---|---|
| String3 | Int64 | String15 | String3 | String3 | Int64 | |
| 1 | AK | 152 | area_code_510 | no | no | 0 |
| 2 | AL | 141 | area_code_510 | no | yes | 28 |
| 3 | AZ | 114 | area_code_510 | no | no | 0 |
| 4 | CO | 154 | area_code_415 | no | no | 0 |
| 5 | CT | 23 | area_code_510 | no | no | 0 |
| 6 | CT | 50 | area_code_408 | no | no | 0 |
| 7 | DC | 82 | area_code_415 | no | no | 0 |
| 8 | DC | 93 | area_code_408 | no | yes | 22 |
| 9 | DE | 129 | area_code_510 | no | no | 0 |
| 10 | FL | 100 | area_code_415 | no | no | 0 |
| 11 | FL | 113 | area_code_415 | no | no | 0 |
| 12 | IA | 44 | area_code_415 | no | no | 0 |
| 13 | IN | 48 | area_code_510 | no | no | 0 |
| 14 | KS | 70 | area_code_415 | no | no | 0 |
| 15 | KS | 92 | area_code_510 | no | no | 0 |
| 16 | KS | 126 | area_code_408 | no | no | 0 |
| 17 | KY | 75 | area_code_415 | no | no | 0 |
| 18 | LA | 67 | area_code_510 | no | no | 0 |
| 19 | MA | 117 | area_code_408 | no | no | 0 |
| 20 | MD | 62 | area_code_408 | no | no | 0 |
| 21 | MD | 93 | area_code_408 | yes | no | 0 |
| 22 | ME | 79 | area_code_510 | yes | no | 0 |
| 23 | ME | 80 | area_code_408 | no | no | 0 |
| 24 | MI | 74 | area_code_415 | no | no | 0 |
| 25 | MN | 13 | area_code_510 | no | yes | 21 |
| 26 | MN | 152 | area_code_415 | no | no | 0 |
| 27 | MO | 112 | area_code_415 | no | no | 0 |
| 28 | MS | 13 | area_code_415 | no | no | 0 |
| 29 | MS | 29 | area_code_510 | no | no | 0 |
| 30 | NC | 161 | area_code_415 | no | no | 0 |
| ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
describe(df)
20 rows × 7 columns (omitted printing of 1 columns)
| variable | mean | min | median | max | nmissing | |
|---|---|---|---|---|---|---|
| Symbol | Union… | Any | Union… | Any | Int64 | |
| 1 | state | AK | WY | 0 | ||
| 2 | account_length | 100.259 | 1 | 100.0 | 243 | 0 |
| 3 | area_code | area_code_408 | area_code_510 | 0 | ||
| 4 | international_plan | no | yes | 0 | ||
| 5 | voice_mail_plan | no | yes | 0 | ||
| 6 | number_vmail_messages | 7.7552 | 0 | 0.0 | 52 | 0 |
| 7 | total_day_minutes | 180.289 | 0.0 | 180.1 | 351.5 | 0 |
| 8 | total_day_calls | 100.029 | 0 | 100.0 | 165 | 0 |
| 9 | total_day_charge | 30.6497 | 0.0 | 30.62 | 59.76 | 0 |
| 10 | total_eve_minutes | 200.637 | 0.0 | 201.0 | 363.7 | 0 |
| 11 | total_eve_calls | 100.191 | 0 | 100.0 | 170 | 0 |
| 12 | total_eve_charge | 17.0543 | 0.0 | 17.09 | 30.91 | 0 |
| 13 | total_night_minutes | 200.392 | 0.0 | 200.4 | 395.0 | 0 |
| 14 | total_night_calls | 99.9192 | 0 | 100.0 | 175 | 0 |
| 15 | total_night_charge | 9.01773 | 0.0 | 9.02 | 17.77 | 0 |
| 16 | total_intl_minutes | 10.2618 | 0.0 | 10.3 | 20.0 | 0 |
| 17 | total_intl_calls | 4.4352 | 0 | 4.0 | 20 | 0 |
| 18 | total_intl_charge | 2.7712 | 0.0 | 2.78 | 5.4 | 0 |
| 19 | number_customer_service_calls | 1.5704 | 0 | 1.0 | 9 | 0 |
| 20 | churn | no | yes | 0 |
You can use the show() function to print out all columns.
show(describe(df),allcols=true)
20×7 DataFrame Row │ variable mean min median max nmissing eltype │ Symbol Union… Any Union… Any Int64 DataType ─────┼────────────────────────────────────────────────────────────────────────────────────────────────── 1 │ state AK WY 0 String3 2 │ account_length 100.259 1 100.0 243 0 Int64 3 │ area_code area_code_408 area_code_510 0 String15 4 │ international_plan no yes 0 String3 5 │ voice_mail_plan no yes 0 String3 6 │ number_vmail_messages 7.7552 0 0.0 52 0 Int64 7 │ total_day_minutes 180.289 0.0 180.1 351.5 0 Float64 8 │ total_day_calls 100.029 0 100.0 165 0 Int64 9 │ total_day_charge 30.6497 0.0 30.62 59.76 0 Float64 10 │ total_eve_minutes 200.637 0.0 201.0 363.7 0 Float64 11 │ total_eve_calls 100.191 0 100.0 170 0 Int64 12 │ total_eve_charge 17.0543 0.0 17.09 30.91 0 Float64 13 │ total_night_minutes 200.392 0.0 200.4 395.0 0 Float64 14 │ total_night_calls 99.9192 0 100.0 175 0 Int64 15 │ total_night_charge 9.01773 0.0 9.02 17.77 0 Float64 16 │ total_intl_minutes 10.2618 0.0 10.3 20.0 0 Float64 17 │ total_intl_calls 4.4352 0 4.0 20 0 Int64 18 │ total_intl_charge 2.7712 0.0 2.78 5.4 0 Float64 19 │ number_customer_service_calls 1.5704 0 1.0 9 0 Int64 20 │ churn no yes 0 String3
You can do even more with the describe() function. In particular, you can ask for the following statistical properties:
mean
std
min
q25
median
q75
max
eltype
nunique
first
last
nmissing
describe(df, :min, :q25, :median, :q75, :max)
20 rows × 6 columns
| variable | min | q25 | median | q75 | max | |
|---|---|---|---|---|---|---|
| Symbol | Any | Union… | Union… | Union… | Any | |
| 1 | state | AK | WY | |||
| 2 | account_length | 1 | 73.0 | 100.0 | 127.0 | 243 |
| 3 | area_code | area_code_408 | area_code_510 | |||
| 4 | international_plan | no | yes | |||
| 5 | voice_mail_plan | no | yes | |||
| 6 | number_vmail_messages | 0 | 0.0 | 0.0 | 17.0 | 52 |
| 7 | total_day_minutes | 0.0 | 143.7 | 180.1 | 216.2 | 351.5 |
| 8 | total_day_calls | 0 | 87.0 | 100.0 | 113.0 | 165 |
| 9 | total_day_charge | 0.0 | 24.43 | 30.62 | 36.75 | 59.76 |
| 10 | total_eve_minutes | 0.0 | 166.375 | 201.0 | 234.1 | 363.7 |
| 11 | total_eve_calls | 0 | 87.0 | 100.0 | 114.0 | 170 |
| 12 | total_eve_charge | 0.0 | 14.14 | 17.09 | 19.9 | 30.91 |
| 13 | total_night_minutes | 0.0 | 166.9 | 200.4 | 234.7 | 395.0 |
| 14 | total_night_calls | 0 | 87.0 | 100.0 | 113.0 | 175 |
| 15 | total_night_charge | 0.0 | 7.51 | 9.02 | 10.56 | 17.77 |
| 16 | total_intl_minutes | 0.0 | 8.5 | 10.3 | 12.0 | 20.0 |
| 17 | total_intl_calls | 0 | 3.0 | 4.0 | 6.0 | 20 |
| 18 | total_intl_charge | 0.0 | 2.3 | 2.78 | 3.24 | 5.4 |
| 19 | number_customer_service_calls | 0 | 1.0 | 1.0 | 2.0 | 9 |
| 20 | churn | no | yes |
using StatsBase
countmap(df[:,:state])
Dict{String3, Int64} with 51 entries:
"DC" => 88
"NH" => 95
"UT" => 112
"WV" => 158
"MN" => 125
"NY" => 114
"GA" => 83
"LA" => 82
"TX" => 116
"MS" => 99
"IA" => 69
"IL" => 88
"NM" => 91
"OK" => 90
"RI" => 99
"NJ" => 112
"KS" => 99
"ID" => 119
"OH" => 116
"HI" => 86
"MA" => 103
"VT" => 101
"MT" => 99
"MI" => 103
"MO" => 93
⋮ => ⋮
print(countmap(df[:,:state]))
Dict{String3, Int64}("DC" => 88, "NH" => 95, "UT" => 112, "WV" => 158, "MN" => 125, "NY" => 114, "GA" => 83, "LA" => 82, "TX" => 116, "MS" => 99, "IA" => 69, "IL" => 88, "NM" => 91, "OK" => 90, "RI" => 99, "NJ" => 112, "KS" => 99, "ID" => 119, "OH" => 116, "HI" => 86, "MA" => 103, "VT" => 101, "MT" => 99, "MI" => 103, "MO" => 93, "FL" => 90, "NV" => 90, "CA" => 52, "SC" => 91, "ME" => 103, "KY" => 99, "WY" => 115, "NC" => 91, "AR" => 92, "AZ" => 89, "IN" => 98, "ND" => 88, "SD" => 85, "DE" => 94, "NE" => 88, "MD" => 102, "AL" => 124, "AK" => 72, "CO" => 96, "WI" => 106, "TN" => 89, "WA" => 98, "CT" => 99, "VA" => 118, "PA" => 77, "OR" => 114)
tables = groupby(df,[:churn])
GroupedDataFrame with 2 groups based on key: churn
First Group (4293 rows): churn = "no"
| state | account_length | area_code | international_plan | voice_mail_plan | number_vmail_messages | |
|---|---|---|---|---|---|---|
| String3 | Int64 | String15 | String3 | String3 | Int64 | |
| 1 | AK | 1 | area_code_408 | no | no | 0 |
| 2 | AK | 36 | area_code_408 | no | yes | 30 |
| 3 | AK | 41 | area_code_415 | no | no | 0 |
| 4 | AK | 42 | area_code_415 | no | no | 0 |
| 5 | AK | 48 | area_code_415 | no | yes | 37 |
| 6 | AK | 50 | area_code_408 | no | no | 0 |
| 7 | AK | 51 | area_code_510 | yes | yes | 12 |
| 8 | AK | 52 | area_code_408 | no | no | 0 |
| 9 | AK | 52 | area_code_415 | no | yes | 24 |
| 10 | AK | 52 | area_code_510 | no | no | 0 |
| 11 | AK | 55 | area_code_408 | no | yes | 39 |
| 12 | AK | 58 | area_code_510 | no | no | 0 |
| 13 | AK | 59 | area_code_408 | no | no | 0 |
| 14 | AK | 59 | area_code_510 | no | no | 0 |
| 15 | AK | 61 | area_code_415 | no | yes | 15 |
| 16 | AK | 68 | area_code_415 | no | no | 0 |
| 17 | AK | 71 | area_code_510 | no | no | 0 |
| 18 | AK | 74 | area_code_415 | no | no | 0 |
| 19 | AK | 76 | area_code_415 | no | no | 0 |
| 20 | AK | 76 | area_code_415 | no | yes | 22 |
| 21 | AK | 78 | area_code_408 | no | no | 0 |
| 22 | AK | 78 | area_code_510 | no | no | 0 |
| 23 | AK | 83 | area_code_415 | no | no | 0 |
| 24 | AK | 86 | area_code_408 | no | no | 0 |
| 25 | AK | 88 | area_code_415 | no | yes | 37 |
| 26 | AK | 91 | area_code_510 | no | no | 0 |
| 27 | AK | 94 | area_code_415 | no | no | 0 |
| 28 | AK | 96 | area_code_408 | no | yes | 29 |
| 29 | AK | 97 | area_code_415 | no | yes | 24 |
| 30 | AK | 98 | area_code_415 | no | no | 0 |
| ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
⋮
Last Group (707 rows): churn = "yes"
| state | account_length | area_code | international_plan | voice_mail_plan | number_vmail_messages | |
|---|---|---|---|---|---|---|
| String3 | Int64 | String15 | String3 | String3 | Int64 | |
| 1 | AK | 36 | area_code_415 | yes | yes | 19 |
| 2 | AK | 111 | area_code_415 | no | no | 0 |
| 3 | AK | 126 | area_code_415 | no | no | 0 |
| 4 | AK | 152 | area_code_510 | no | no | 0 |
| 5 | AK | 177 | area_code_415 | yes | no | 0 |
| 6 | AL | 25 | area_code_408 | no | no | 0 |
| 7 | AL | 26 | area_code_408 | no | no | 0 |
| 8 | AL | 55 | area_code_415 | yes | no | 0 |
| 9 | AL | 60 | area_code_408 | yes | yes | 29 |
| 10 | AL | 64 | area_code_510 | no | no | 0 |
| 11 | AL | 76 | area_code_415 | no | no | 0 |
| 12 | AL | 86 | area_code_415 | no | no | 0 |
| 13 | AL | 89 | area_code_510 | no | no | 0 |
| 14 | AL | 91 | area_code_415 | no | no | 0 |
| 15 | AL | 93 | area_code_408 | no | no | 0 |
| 16 | AL | 166 | area_code_415 | yes | no | 0 |
| 17 | AL | 172 | area_code_408 | no | no | 0 |
| 18 | AL | 197 | area_code_415 | yes | no | 0 |
| 19 | AR | 32 | area_code_415 | yes | no | 0 |
| 20 | AR | 41 | area_code_415 | no | no | 0 |
| 21 | AR | 49 | area_code_408 | yes | yes | 32 |
| 22 | AR | 54 | area_code_415 | no | no | 0 |
| 23 | AR | 76 | area_code_408 | no | no | 0 |
| 24 | AR | 90 | area_code_408 | no | no | 0 |
| 25 | AR | 98 | area_code_415 | no | no | 0 |
| 26 | AR | 99 | area_code_510 | yes | no | 0 |
| 27 | AR | 107 | area_code_415 | yes | no | 0 |
| 28 | AR | 109 | area_code_415 | no | yes | 29 |
| 29 | AR | 113 | area_code_415 | yes | no | 0 |
| 30 | AR | 115 | area_code_415 | no | no | 0 |
| ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
combine(tables, [:total_day_charge] => mean)
2 rows × 2 columns
| churn | total_day_charge_mean | |
|---|---|---|
| String3 | Float64 | |
| 1 | no | 29.8775 |
| 2 | yes | 35.3384 |
using StatsPlots
@df df scatter(
:total_day_charge,
:total_eve_charge,
title = "Total Day Charge vs. Total Evening Charge",
xlab = "Total Day Charge",
ylab = "Total Evening Charge"
)
@df df histogram(
:total_day_charge,
bins = 20,
title = "Distribution of Total Day Charge",
xlab = "Total Day Charge",
ylab = "Frequency"
)
@df df groupedhist(:total_day_charge, group = :churn, bar_position = :dodge)
#gr(size = (600, 500))
@df df corrplot([:total_day_charge :total_eve_charge :total_night_charge], grid=false)
using ScikitLearn
@sk_import preprocessing: LabelEncoder
labelencoder = LabelEncoder()
df.state = fit_transform!(labelencoder, df.state)
df.area_code = fit_transform!(labelencoder, df.area_code)
df.international_plan = fit_transform!(labelencoder, df.international_plan)
df.voice_mail_plan = fit_transform!(labelencoder, df.voice_mail_plan)
df.churn = fit_transform!(labelencoder, df.churn)
5000-element Vector{Int64}:
0
0
1
0
0
0
0
0
0
0
0
0
0
⋮
0
0
0
1
0
0
0
0
0
0
0
0
df[1:10,:]
10 rows × 20 columns (omitted printing of 14 columns)
| state | account_length | area_code | international_plan | voice_mail_plan | number_vmail_messages | |
|---|---|---|---|---|---|---|
| Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | |
| 1 | 0 | 1 | 0 | 0 | 0 | 0 |
| 2 | 0 | 36 | 0 | 0 | 1 | 30 |
| 3 | 0 | 36 | 1 | 1 | 1 | 19 |
| 4 | 0 | 41 | 1 | 0 | 0 | 0 |
| 5 | 0 | 42 | 1 | 0 | 0 | 0 |
| 6 | 0 | 48 | 1 | 0 | 1 | 37 |
| 7 | 0 | 50 | 0 | 0 | 0 | 0 |
| 8 | 0 | 51 | 2 | 1 | 1 | 12 |
| 9 | 0 | 52 | 0 | 0 | 0 | 0 |
| 10 | 0 | 52 | 1 | 0 | 1 | 24 |
countmap(df[:,:churn])
Dict{Int64, Int64} with 2 entries:
0 => 4293
1 => 707
Breloff, T. (2020). StatsPlots Documentation. GitHub. Retrieved from https://github.com/JuliaPlots/StatsPlots.jl.
Bezanson, J., Karpinski, S., Shah, V., and Edelman, A. (2012). Why we created julia. Julia. Retrieved from https://julialang.org/blog/2012/02/why-we-created-julia/.
England, A. (2018). Tutorial: tuning and fitting machine learning models with Julia. LinkedIn. Retrieved from https://www.linkedin.com/pulse/tutorial-tuning-fitting-machine-learning-models-julia-england-ph-d/.
Julia (2020). Julia. Retrieved from https://julialang.org/.
microgold (2018). Simple tools for train test split. Julia Discourse. Retrieved from https://discourse.julialang.org/t/simple-tool-for-train-test-split/473.
Yegulalp, S. (2020). Julia v. python: which is best for data science? InfoWorld. Retrieved from https://www.infoworld.com/article/3241107/julia-vs-python-which-is-best-for-data-science.html.